Major credit goes to this notebook for ideas https://www.kaggle.com/code/stassl/recovering-time-id-order, I have provided my own inference to attempt to recover time order in this notebook
#Installing and Importing Libraries
!pip install yfinance
import yfinance as yf
!pip install umap-learn
import umap
import pandas as pd
import numpy as np
# import yfinance as yf
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
from glob import glob
from joblib import Parallel, delayed
from sklearn.manifold import TSNE, SpectralEmbedding
from sklearn.preprocessing import minmax_scale
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer
from mpl_toolkits.axes_grid1 import make_axes_locatable
%config InlineBackend.figure_format = 'retina'
sns.set_theme('notebook', 'white', font_scale=1.2, palette='tab10')
Requirement already satisfied: yfinance in c:\programdata\anaconda3\lib\site-packages (0.2.55) Requirement already satisfied: pandas>=1.3.0 in c:\programdata\anaconda3\lib\site-packages (from yfinance) (2.1.4) Requirement already satisfied: numpy>=1.16.5 in c:\programdata\anaconda3\lib\site-packages (from yfinance) (1.23.5) Requirement already satisfied: requests>=2.31 in c:\programdata\anaconda3\lib\site-packages (from yfinance) (2.31.0) Requirement already satisfied: multitasking>=0.0.7 in c:\programdata\anaconda3\lib\site-packages (from yfinance) (0.0.11) Requirement already satisfied: platformdirs>=2.0.0 in c:\programdata\anaconda3\lib\site-packages (from yfinance) (3.10.0) Requirement already satisfied: pytz>=2022.5 in c:\programdata\anaconda3\lib\site-packages (from yfinance) (2023.3.post1) Requirement already satisfied: frozendict>=2.3.4 in c:\programdata\anaconda3\lib\site-packages (from yfinance) (2.4.2) Requirement already satisfied: peewee>=3.16.2 in c:\programdata\anaconda3\lib\site-packages (from yfinance) (3.17.9) Requirement already satisfied: beautifulsoup4>=4.11.1 in c:\programdata\anaconda3\lib\site-packages (from yfinance) (4.12.2) Requirement already satisfied: soupsieve>1.2 in c:\programdata\anaconda3\lib\site-packages (from beautifulsoup4>=4.11.1->yfinance) (2.5) Requirement already satisfied: python-dateutil>=2.8.2 in c:\programdata\anaconda3\lib\site-packages (from pandas>=1.3.0->yfinance) (2.8.2) Requirement already satisfied: tzdata>=2022.1 in c:\programdata\anaconda3\lib\site-packages (from pandas>=1.3.0->yfinance) (2023.3) Requirement already satisfied: charset-normalizer<4,>=2 in c:\programdata\anaconda3\lib\site-packages (from requests>=2.31->yfinance) (2.0.4) Requirement already satisfied: idna<4,>=2.5 in c:\programdata\anaconda3\lib\site-packages (from requests>=2.31->yfinance) (3.4) Requirement already satisfied: urllib3<3,>=1.21.1 in c:\programdata\anaconda3\lib\site-packages (from requests>=2.31->yfinance) (2.0.7) Requirement already satisfied: certifi>=2017.4.17 in c:\programdata\anaconda3\lib\site-packages (from requests>=2.31->yfinance) (2025.1.31) Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site-packages (from python-dateutil>=2.8.2->pandas>=1.3.0->yfinance) (1.16.0) Requirement already satisfied: umap-learn in c:\programdata\anaconda3\lib\site-packages (0.5.4) Requirement already satisfied: numpy>=1.17 in c:\programdata\anaconda3\lib\site-packages (from umap-learn) (1.23.5) Requirement already satisfied: scipy>=1.3.1 in c:\programdata\anaconda3\lib\site-packages (from umap-learn) (1.11.4) Requirement already satisfied: scikit-learn>=0.22 in c:\programdata\anaconda3\lib\site-packages (from umap-learn) (1.2.2) Requirement already satisfied: numba>=0.51.2 in c:\programdata\anaconda3\lib\site-packages (from umap-learn) (0.59.0) Requirement already satisfied: pynndescent>=0.5 in c:\programdata\anaconda3\lib\site-packages (from umap-learn) (0.5.10) Requirement already satisfied: tqdm in c:\programdata\anaconda3\lib\site-packages (from umap-learn) (4.65.0) Requirement already satisfied: llvmlite<0.43,>=0.42.0dev0 in c:\programdata\anaconda3\lib\site-packages (from numba>=0.51.2->umap-learn) (0.42.0) Requirement already satisfied: joblib>=0.11 in c:\programdata\anaconda3\lib\site-packages (from pynndescent>=0.5->umap-learn) (1.2.0) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn>=0.22->umap-learn) (2.2.0) Requirement already satisfied: colorama in c:\programdata\anaconda3\lib\site-packages (from tqdm->umap-learn) (0.4.6)
#Downloading Kaggle Dataset for stock
data_dir = 'data'
df_files = pd.DataFrame({'book_path': glob(f'{data_dir}/book_train.parquet/**/*.parquet')}) \
.assign(stock_id=lambda x: x.book_path.str.extract("stock_id=(\d+)").astype('int')) \
.sort_values('stock_id')
df_target_train = pd.read_csv(f'{data_dir}/train.csv')
df_volatility_train = df_target_train.groupby('time_id').target.mean()
#Defining functions for EDA
def rmspe(y_true, y_pred):
return ((((y_true - y_pred) / y_true)) ** 2).mean() ** 0.5
def plot_price(stock_id, time_id, price_name, kind, ax):
r = df_files.query(f'stock_id == {stock_id}').iloc[0]
df = pd.read_parquet(r.book_path, columns=['time_id', 'seconds_in_bucket', price_name])
df = df.query(f'time_id == {time_id}').drop(columns='time_id').set_index('seconds_in_bucket').reindex(np.arange(600), method='ffill')
min_diff = np.nanmin(abs(df[price_name].diff().where(lambda x: x > 0)))
if kind == 'price_norm':
df[price_name].plot.line(legend=False, ax=ax)
ax.set_title(f'stock_id={stock_id}, time_id={time_id}: {price_name} normalized')
elif kind == 'price_change':
df = df[price_name].diff().reset_index()
df.plot.bar(x='seconds_in_bucket', y=price_name, color=np.where(df[price_name] > 0, 'g', 'r'), legend=False, edgecolor='none', width=1, ax=ax)
ax.set_title(f'stock_id={stock_id}, time_id={time_id}: {price_name} change')
ax.yaxis.set_major_locator(mpl.ticker.MultipleLocator(min_diff))
elif kind == 'ticks_change':
df = df[price_name].diff().div(min_diff).reset_index()
df.plot.bar(x='seconds_in_bucket', y=price_name, color=np.where(df[price_name] > 0, 'g', 'r'), legend=False, edgecolor='none', width=1, ax=ax)
ax.set_title(f'stock_id={stock_id}, time_id={time_id}: {price_name} change (ticks)')
ax.yaxis.set_major_locator(mpl.ticker.MultipleLocator(1))
elif kind == 'price_original':
df[price_name] = 0.01 / min_diff * df[price_name]
df[price_name].plot.line(legend=False, ax=ax)
ax.set_title(f'stock_id={stock_id}, time_id={time_id}: {price_name} original')
ax.xaxis.set_major_locator(mpl.ticker.MultipleLocator(30))
ax.xaxis.set_tick_params(rotation=0)
ax.set_axisbelow(True)
ax.grid(axis='y', linestyle='--')
ax.set_xlim(0, 600)
def plot_emb(emb, color, name, kind='volatility', fig=None, ax=None):
if fig is None or ax is None:
fig, ax = plt.subplots(figsize=(7, 7))
if kind == 'volatility':
norm = mpl.colors.LogNorm()
ticks = mpl.ticker.LogLocator(2)
formatter = mpl.ticker.ScalarFormatter()
elif kind == 'date':
norm = None
ticks = None
formatter = mpl.dates.AutoDateFormatter(mpl.dates.MonthLocator())
plot = ax.scatter(emb[:, 0], emb[:, 1], s=3, c=color, edgecolors='none', cmap='jet', norm=norm);
divider = make_axes_locatable(ax)
cax = divider.append_axes('right', size='5%', pad=0.2)
cb = fig.colorbar(plot, label=kind, format=formatter,
ticks=ticks, cax=cax)
cb.ax.minorticks_off()
ax.set_title(f'{name}')
To better understand the idea let's plot some charts for selected stock_id/time_id:
First graph is the normalised price as it appears in the dataset given to us by Optiver. Second graph is simply the delta of each price movement (price.diff() formula). Price seems to change by integral multiples. Third graph is price.diff() / min(price.diff()), thus we get price changes in terms of ticks (the miimum price difference). CLearly we see that the price changes as integer multiples of ticks. Tha last graph is restored, real price = price * 0.01 / min(price.diff())
#Graphical EDA work
plot_types = ['price_norm', 'price_change', 'ticks_change', 'price_original']
for kind in plot_types:
fig, ax = plt.subplots(figsize=(13, 4))
plot_price(89, 103, 'ask_price1', kind, ax)
if kind == 'price_norm':
desc = 'original, normalised price of'
elif kind == 'price_change':
desc = 'original, normalised price movements of'
elif kind == 'ticks_change':
desc = 'price change divided by tick size of'
elif kind == 'price_original':
desc = 'restored, real price of'
if kind == 'price_norm':
y = 'Normalised price'
elif kind == 'price_change':
y = 'Normalised price movements '
elif kind == 'ticks_change':
y = 'Price change divided by tick size'
elif kind == 'price_original':
y = 'Real price'
title = f"A Graph Showing the evolution of the {desc} Stock 89 across the time interval 103"
ax.set_title(title)
ax.set_xlabel('Seconds within the interval 103')
ax.set_ylabel(y)
plt.tight_layout()
fig.savefig(f'stock_price_{kind}.png', dpi=300, bbox_inches='tight')
#calculating the real prices
def calc_price(df):
diff = abs(df.diff())
min_diff = np.nanmin(diff.where(lambda x: x > 0))
n_ticks = (diff / min_diff).round()
return 0.01 / np.nanmean(diff / n_ticks)
def calc_prices(r):
df = pd.read_parquet(r.book_path, columns=['time_id', 'ask_price1', 'ask_price2', 'bid_price1', 'bid_price2'])
df = df.groupby('time_id').apply(calc_price).to_frame('price').reset_index()
df['stock_id'] = r.stock_id
return df
df_prices_denorm = pd.concat(Parallel(n_jobs=-1, verbose=0)(delayed(calc_prices)(r) for _, r in df_files.iterrows()))
df_prices_denorm = df_prices_denorm.pivot(index = 'time_id', columns= 'stock_id',values= 'price')
df_prices_denorm
| stock_id | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 115 | 116 | 118 | 119 | 120 | 122 | 123 | 124 | 125 | 126 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| time_id | |||||||||||||||||||||
| 5 | 193.382499 | 152.416327 | 123.461428 | 226.012232 | 619.198910 | 738.256609 | 370.275801 | 245.870983 | 283.881404 | 238.394801 | ... | 88.143530 | 246.326035 | 210.790493 | 66.225351 | 96.217049 | 142.352639 | 108.310134 | 84.344766 | 53.375172 | 310.446018 |
| 11 | 199.230489 | 149.512019 | 128.641219 | 249.893186 | 614.775587 | 769.481159 | 411.690103 | 256.711224 | 278.116800 | 244.391095 | ... | 90.285607 | 275.920002 | 213.987639 | 63.064600 | 101.239489 | 136.924692 | 105.482065 | 91.095218 | 55.886795 | 300.948142 |
| 16 | 208.900108 | 104.885672 | 118.687626 | 164.755260 | 534.006468 | 584.016561 | 236.595134 | 208.127709 | 123.656642 | 177.405905 | ... | 70.035150 | 189.724268 | 281.007926 | 53.520546 | 74.554427 | 93.370637 | 77.421019 | 55.054433 | 53.084003 | 194.495613 |
| 31 | 216.138269 | 137.831207 | 138.326846 | 235.951400 | 657.637025 | 804.561657 | 358.949356 | 260.607163 | 194.206656 | 232.626591 | ... | 90.688217 | 256.869865 | 235.129520 | 61.245171 | 90.928457 | 134.847296 | 103.739264 | 92.465027 | 58.124482 | 259.149062 |
| 62 | 214.516335 | 140.650237 | 137.213402 | 238.242521 | 639.647230 | 765.430361 | 403.068356 | 250.621870 | 187.756607 | 243.460002 | ... | 89.140911 | 248.061738 | 231.193774 | 59.542960 | 87.940275 | 125.482870 | 103.529649 | 96.060419 | 57.373160 | 259.214214 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 32751 | 192.207769 | 148.865368 | 132.110590 | 243.722214 | 652.082489 | 745.338533 | 389.702447 | 252.020435 | 252.114143 | 230.046724 | ... | 88.523948 | 269.021445 | 241.987520 | 60.049757 | 93.429990 | 135.630818 | 103.531318 | 94.184826 | 56.814400 | 306.255498 |
| 32753 | 199.748994 | 143.562752 | 128.829956 | 245.019729 | 616.837441 | 755.527664 | 389.036260 | 249.891005 | 258.463170 | 241.465759 | ... | 91.186063 | 269.457870 | 212.935865 | 62.066114 | 92.794336 | 136.533152 | 104.054983 | 87.874304 | 55.995629 | 290.136260 |
| 32758 | 198.471328 | 111.979022 | 142.683929 | 216.825656 | 591.479678 | 729.433334 | 509.977233 | 234.174952 | 132.590478 | 206.086532 | ... | 80.310357 | 225.500215 | 211.873457 | 43.955594 | 80.582209 | 111.225451 | 89.826475 | 81.064096 | 56.597615 | 202.570521 |
| 32763 | 208.002502 | 81.295087 | 116.000506 | 107.229172 | 516.903933 | 502.602869 | 143.001750 | 182.616348 | 94.160764 | 162.246740 | ... | 59.235750 | 151.293921 | 263.571506 | 44.825190 | 66.755134 | 77.321223 | 71.994967 | 52.124935 | 50.174789 | 151.879455 |
| 32767 | 208.721071 | 101.052147 | 128.278257 | 199.130195 | 564.778295 | 683.743827 | 254.351346 | 245.839823 | 116.209637 | 185.205086 | ... | 76.478989 | 196.456837 | 280.672331 | 51.486737 | 81.482505 | 105.091738 | 75.074348 | 68.755608 | 55.856964 | 194.435959 |
3830 rows × 112 columns
Now we have the real price of stocks - the plot below exhibits the price distributions for each stock- able to identify stock 61 as AMZN - the most expensive stock during the duration fo the dataset (January 1, 2020, until March 31, 2021)
plt.figure(figsize=(15, 20))
ax = sns.stripplot(data=df_prices_denorm, orient='h', alpha=0.3, s=2, jitter=0.2,
order=df_prices_denorm.median().sort_values().index[::-1].tolist(),
palette='Spectral')
ax.tick_params(axis='y', which='major', labelsize=10)
plt.xlabel('Real Price')
plt.ylabel('Stock')
plt.title('Real Price distribution by stock');
plt.savefig(f'Stock Price Distribution.png', dpi=300, bbox_inches='tight')
C:\ProgramData\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\ProgramData\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
#downloading actual stock price data from yahoo finance form 1 Jan 2020 to 1 June 2021
SP100_tickers = pd.read_html('https://en.wikipedia.org/wiki/S%26P_100')[2].Symbol
SP100_tickers = SP100_tickers[SP100_tickers != 'BRK.B']
df_prices_real = yf.download(SP100_tickers.to_list(), start='2020-01-01', end='2021-06-01', interval='1d')
YF.download() has changed argument auto_adjust default to True
[*********************100%***********************] 100 of 100 completed
#Garman-Klass Estimator for RV
df_volatility_real = 1 / 2 * np.log(df_prices_real.High / df_prices_real.Low) ** 2 - \
(2 * np.log(2) - 1) * np.log(df_prices_real.Close / df_prices_real.Open) ** 2
df_volatility_real = df_volatility_real.mean(axis=1)
df_prices_real = df_prices_real.Open.fillna(df_prices_real.Open.mean()).dropna(axis=1).sample(frac=1)
df_volatility_real = df_volatility_real.loc[df_prices_real.index]
df_prices_real
| Ticker | AAPL | ABBV | ABT | ACN | ADBE | AIG | AMD | AMGN | AMT | AMZN | ... | TXN | UNH | UNP | UPS | USB | V | VZ | WFC | WMT | XOM |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | |||||||||||||||||||||
| 2020-09-28 | 112.146101 | 72.656763 | 96.215211 | 206.031476 | 487.970001 | 24.883914 | 79.120003 | 215.274791 | 216.043833 | 157.442505 | ... | 123.786873 | 287.101634 | 182.678577 | 144.752861 | 29.263500 | 194.239424 | 45.870462 | 21.638643 | 42.850523 | 28.815179 |
| 2021-02-02 | 132.578491 | 87.814601 | 114.740217 | 236.826267 | 473.649994 | 34.545483 | 88.489998 | 210.016600 | 211.005420 | 169.000000 | ... | 154.799905 | 317.014949 | 182.581658 | 139.884200 | 36.050885 | 194.745388 | 42.819425 | 27.600633 | 43.822402 | 38.470995 |
| 2021-05-06 | 125.107203 | 99.353265 | 110.168682 | 275.074599 | 485.670013 | 44.735769 | 77.629997 | 219.759237 | 219.457561 | 163.500000 | ... | 163.641660 | 389.772416 | 205.321075 | 185.381485 | 50.976233 | 222.996256 | 46.842384 | 42.005177 | 44.254041 | 51.977871 |
| 2020-08-19 | 113.094409 | 80.830539 | 93.298483 | 219.864733 | 464.290009 | 26.245686 | 81.779999 | 209.486154 | 223.987728 | 165.150497 | ... | 123.301356 | 295.610177 | 172.786680 | 135.912857 | 29.638672 | 192.583879 | 45.716426 | 21.656681 | 41.928911 | 34.727450 |
| 2020-02-14 | 78.807583 | 77.159163 | 81.368586 | 198.910063 | 376.279999 | 43.363203 | 55.189999 | 191.179636 | 225.433346 | 107.783997 | ... | 116.123713 | 279.602173 | 164.514862 | 87.093330 | 43.981902 | 201.014410 | 44.230394 | 42.382478 | 36.290171 | 48.238263 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2021-05-12 | 120.919940 | 99.223676 | 109.347431 | 271.036043 | 477.190002 | 46.570846 | 75.089996 | 221.636926 | 221.092676 | 159.250000 | ... | 161.626835 | 386.271081 | 206.281486 | 182.862801 | 51.227971 | 217.169636 | 46.317589 | 42.661088 | 43.896849 | 51.998558 |
| 2020-10-08 | 113.355227 | 72.640115 | 99.814973 | 212.567221 | 499.049988 | 26.818129 | 88.110001 | 216.656784 | 216.657084 | 161.249496 | ... | 129.622223 | 303.068998 | 185.669032 | 149.557778 | 32.182572 | 197.502025 | 46.016362 | 22.441408 | 44.209512 | 27.781969 |
| 2020-04-21 | 67.047369 | 67.849945 | 88.358838 | 160.052191 | 340.899994 | 19.815247 | 56.900002 | 201.530597 | 215.990075 | 120.830498 | ... | 95.870780 | 255.483331 | 127.779734 | 84.077477 | 26.601109 | 156.341307 | 43.708375 | 23.682775 | 40.291904 | 31.869601 |
| 2020-12-10 | 117.702124 | 92.013437 | 98.869652 | 232.120649 | 483.739990 | 35.337674 | 89.550003 | 200.933355 | 195.725564 | 154.449493 | ... | 144.184045 | 324.594645 | 186.615000 | 143.184163 | 37.579342 | 202.172582 | 47.682052 | 26.023973 | 46.327693 | 36.279134 |
| 2021-03-29 | 119.002971 | 90.272865 | 112.508635 | 263.559182 | 469.029999 | 42.115539 | 77.029999 | 221.742759 | 212.319289 | 152.772003 | ... | 165.355487 | 355.193051 | 203.784481 | 144.573758 | 46.269056 | 206.312578 | 45.698707 | 35.069023 | 42.506613 | 48.971598 |
355 rows × 100 columns
Note: this is adjusted price using stock splits e.g will have to multiply AMZN by 20 to get real unadjusted price for comparsion graphically at end
df_prices_denorm_scaled = df_prices_denorm.fillna(df_prices_denorm.mean())
df_prices_denorm_scaled = pd.DataFrame(minmax_scale(df_prices_denorm_scaled), index=df_prices_denorm.index)
df_prices_real_scaled = df_prices_real.fillna(df_prices_real.mean())
df_prices_real_scaled = pd.DataFrame(minmax_scale(df_prices_real_scaled), index=df_prices_real.index)
Now we have the real prices – we can use this to recover the chronological order.
As there are 112 different stocks, I can model it as a point in a 112-dimensional space. I hypothesize that it would be nonsensical for a prediction challenge to have the order completely randomized. Therefore, the neighboring time_ids should also be close in space. This means that even though the time_ids may exist in a 112-dimensional space, they might be able to be represented by a smooth, 1-dimensional curve.
Therefore, below I apply dimensionality reduction methods, such as t-SNE (t-Distributed Stochastic Neighbour Embedding), PCA (Principal Component Analysis), UMAP (Uniform Manifold Approximation and Projection), and Spectral Embeddings.
f, ax = plt.subplots(1, 2, figsize=(14, 6))
emb = PCA(n_components=2)
emb_denorm = emb.fit_transform(df_prices_denorm_scaled)
emb_real = emb.fit_transform(df_prices_real_scaled)
plot_emb(emb_denorm, df_volatility_train, 'Denormalised Prices', 'volatility', f, ax[0])
plot_emb(emb_real, df_volatility_real, 'Real Prices', 'volatility', f, ax[1])
f.suptitle('PCA Embeddings')
plt.tight_layout()
f, ax = plt.subplots(1, 2, figsize=(14, 6))
emb = TSNE(n_components=2, perplexity=40, learning_rate=50,
verbose=1, init='pca', n_iter=2000,
early_exaggeration=12)
emb_denorm = emb.fit_transform(df_prices_denorm_scaled)
emb_real = emb.fit_transform(df_prices_real_scaled)
plot_emb(emb_denorm, df_volatility_train, 'Denormalised prices', 'volatility', f, ax[0])
plot_emb(emb_real, df_volatility_real, 'Real prices', 'volatility', f, ax[1])
f.suptitle('TSNE embeddings')
plt.tight_layout()
[t-SNE] Computing 121 nearest neighbors... [t-SNE] Indexed 3830 samples in 0.002s... [t-SNE] Computed neighbors for 3830 samples in 0.240s... [t-SNE] Computed conditional probabilities for sample 1000 / 3830 [t-SNE] Computed conditional probabilities for sample 2000 / 3830 [t-SNE] Computed conditional probabilities for sample 3000 / 3830 [t-SNE] Computed conditional probabilities for sample 3830 / 3830 [t-SNE] Mean sigma: 0.232494 [t-SNE] KL divergence after 250 iterations with early exaggeration: 50.946220 [t-SNE] KL divergence after 2000 iterations: 0.152490 [t-SNE] Computing 121 nearest neighbors... [t-SNE] Indexed 355 samples in 0.001s... [t-SNE] Computed neighbors for 355 samples in 0.038s... [t-SNE] Computed conditional probabilities for sample 355 / 355 [t-SNE] Mean sigma: 0.748880 [t-SNE] KL divergence after 250 iterations with early exaggeration: 43.533482 [t-SNE] KL divergence after 2000 iterations: 0.093300
f, ax = plt.subplots(1, 2, figsize=(14, 6))
emb = umap.UMAP(n_neighbors=60, min_dist=0.1, target_metric='euclidean',
init='spectral', low_memory=False, verbose=True,
spread=0.5, local_connectivity=1, repulsion_strength=1,
negative_sample_rate=5)
emb_denorm = emb.fit_transform(df_prices_denorm_scaled)
emb_real = emb.fit_transform(df_prices_real_scaled)
plot_emb(emb_denorm, df_volatility_train, 'Denormalised prices', 'volatility', f, ax[0])
plot_emb(emb_real, df_volatility_real, 'Real prices', 'volatility', f, ax[1])
f.suptitle('UMAP embeddings')
plt.tight_layout()
UMAP(local_connectivity=1, low_memory=False, n_neighbors=60, repulsion_strength=1, spread=0.5, target_metric='euclidean', verbose=True) Fri Apr 4 22:26:04 2025 Construct fuzzy simplicial set Fri Apr 4 22:26:09 2025 Finding Nearest Neighbors Fri Apr 4 22:26:10 2025 Finished Nearest Neighbor Search Fri Apr 4 22:26:12 2025 Construct embedding
Epochs completed: 0%| 0/500 [00:00]
completed 0 / 500 epochs completed 50 / 500 epochs completed 100 / 500 epochs completed 150 / 500 epochs completed 200 / 500 epochs completed 250 / 500 epochs completed 300 / 500 epochs completed 350 / 500 epochs completed 400 / 500 epochs completed 450 / 500 epochs Fri Apr 4 22:26:17 2025 Finished embedding UMAP(local_connectivity=1, low_memory=False, n_neighbors=60, repulsion_strength=1, spread=0.5, target_metric='euclidean', verbose=True) Fri Apr 4 22:26:17 2025 Construct fuzzy simplicial set Fri Apr 4 22:26:17 2025 Finding Nearest Neighbors Fri Apr 4 22:26:17 2025 Finished Nearest Neighbor Search Fri Apr 4 22:26:17 2025 Construct embedding
Epochs completed: 0%| 0/500 [00:00]
completed 0 / 500 epochs completed 50 / 500 epochs completed 100 / 500 epochs completed 150 / 500 epochs completed 200 / 500 epochs completed 250 / 500 epochs completed 300 / 500 epochs completed 350 / 500 epochs completed 400 / 500 epochs completed 450 / 500 epochs Fri Apr 4 22:26:18 2025 Finished embedding
f, ax = plt.subplots(1, 2, figsize=(14, 6))
emb = SpectralEmbedding(random_state=2)
emb_denorm = emb.fit_transform(df_prices_denorm_scaled)
emb_real = emb.fit_transform(df_prices_real_scaled)
plot_emb(emb_denorm, df_volatility_train, 'Denormalised prices', 'volatility', f, ax[0])
plot_emb(emb_real, df_volatility_real, 'Real prices', 'volatility', f, ax[1])
f.suptitle('Spectral embeddings')
plt.tight_layout()
plt.savefig(f'Spectral Embeddings.png', dpi=300, bbox_inches='tight')
From the method of Spectral Embeddings, I could observe the 1-dimensional manifold as required, with a huge amount of similarities between the embeddings of the reconstructed prices and the real prices, confirming that price normalisation is working correctly. Furthermore, we could also observe the single outlying cluster of high volatility, which of course alludes to the 2020 Stock Market crash, as a result of Covid-19.
To confirm the hypothesis that spectral embeddings sort the data by date, below depicts the same plot, but colour-coded it by date, which I can infer from the Yahoo Finance's dataset.
plot_emb(emb_real, [mpl.dates.date2num(i) for i in df_volatility_real.index], 'real prices', 'date')
plt.savefig(f'Real Prices Date.png', dpi=300, bbox_inches='tight')
Quite clearly, one observes that spectral embeding sorts the observations by date. Therefore we are able to use the X-coordinates of spectral embeddings to approximately recover the chronological order of time_ids.
df_prices_denorm_ordered = df_prices_denorm.iloc[np.argsort(-emb_denorm[:, 0])]
df_prices_denorm_ordered.reset_index(drop=True).rolling(10).mean(). \
plot(subplots=True, layout=(-1, 5), figsize=(15, 60), sharex=True, lw=1)
plt.suptitle('Denormalized prices in recovered time order')
plt.subplots_adjust(top=0.97, wspace=0.3);
One easy observation from this is that clearly the 2020 stock market crash from COvid-19 is consistent in the stock prices.
We also plot real prices in order to compare:
df_prices_real.plot(subplots=True, layout=(-1, 5), figsize=(15, 60), sharex=True, lw=1);
plt.xticks([])
plt.suptitle('Real prices')
plt.subplots_adjust(top=0.97, wspace=0.3);
We can see clealry that Stock 61's price evolves very similarly to AMZN. We plot the reconstructed and the real price in one plot for easier visualisation of this, noting that we need to convert the Yahoo finance prices as they are adjsuted for stock splits (Amazon has a 20:1 https://companiesmarketcap.com/gbp/amazon/stock-splits/#google_vignette)
df_prices_real['AMZN'] *= 20 # to allow for stock split
_, ax = plt.subplots(1, 1, figsize=(15, 5))
df_prices_real['AMZN'].sort_index().to_frame().set_index(np.linspace(0, 1, len(df_prices_real))).plot(lw=1, ax=ax)
df_prices_denorm_ordered[61].rolling(10).mean().to_frame().set_index(np.linspace(0.02, 0.86, len(df_prices_denorm_ordered))).plot(lw=1, ax=ax);
Clearly we observe that we have approximately reconstructed Amazon's price and the chronological order of the time_ids. We also note that we can compare General Electric's real vs reconsructed stock price evolution, to reach certainity that we have correctly reconstructed the time series dataset.We note the 0.2:1 stock split: https://www.investing.com/equities/general-electric-historical-data-splits
#0.2:1 stock split https://www.investing.com/equities/general-electric-historical-data-splits
df_prices_real['GE']*= 0.2
_, ax = plt.subplots(1, 1, figsize=(15, 5))
df_prices_real['GE'].sort_index().to_frame().set_index(np.linspace(0, 1, len(df_prices_real))).plot(lw=1, ax=ax)
df_prices_denorm_ordered[31].rolling(10).mean().to_frame().set_index(np.linspace(0, 0.88, len(df_prices_denorm_ordered))).plot(lw=1, ax=ax);